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Abstract 

The  behavior  of  conventional  Krylov  Subspace  Methods  (KSMs)  in  finite  precision  arithmetic  is  a  well-studied 
problem.  The  finite  precision  Lanczos  process,  which  drives  convergence  of  these  methods,  can  lead  to  a  significant 
deviation  between  the  recursively  computed  residual  and  the  true  residual,  b  —  Axk ,  decreasing  the  maximum 
attainable  accuracy  of  the  solution.  Van  der  Vorst  and  Ye  [24]  have  advocated  the  use  of  a  residual  replacement 
strategy  for  KSMs  to  prevent  the  accumulation  of  this  error,  in  which  the  computed  residual  is  replaced  by  the 
true  residual  at  specific  iterations  chosen  such  that  the  Lanczos  process  is  undisturbed. 

Recent  results  have  demonstrated  the  performance  benefits  of  Communication- Avoiding  Krylov  Subspace 
Methods  (CA-KSMs),  variants  of  KSMs  which  use  blocking  strategies  to  perform  s  computation  steps  of  the 
algorithm  for  each  communication  step.  This  allows  an  O(s)  reduction  in  total  communication  cost,  which  can 
lead  to  significant  speedups  on  modern  computer  architectures.  Despite  the  potential  performance  benefits  of 
CA-KSMs,  the  finite  precision  error  in  these  variants  grows  with  s,  an  obstacle  for  their  use  in  practice. 

Following  the  work  of  Van  der  Vorst  and  Ye  ,  we  bound  the  deviation  of  the  true  and  computed  residual  in  finite 
precision  CA-KSMs,  which  leads  to  an  implicit  residual  replacement  strategy.  We  are  the  first,  to  our  knowledge, 
to  perform  this  analysis  for  CA-KSMs.  We  show  how  to  implement  our  strategy  without  affecting  the  asymptotic 
communication  or  computation  cost  of  the  algorithm.  Numerical  experiments  demonstrate  the  effectiveness  of 
our  residual  replacement  strategy  for  both  CA-CG  and  CA-BICG.  Specifically,  it  is  shown  that  accuracy  of  order 
0(e)||A||  ■  ||a;||  can  be  achieved  with  a  small  number  of  residual  replacement  steps  for  an  appropriately  chosen 
polynomial  basis,  which  demonstrates  the  potential  for  practical  use  of  CA-KSMs. 


1  Introduction 

Krylov  subspace  methods  (KSMs)  are  a  class  of  iterative  algorithms  commonly  used  to  solve  linear  systems.  These 
methods  work  by  iteratively  adding  a  dimension  to  a  Krylov  subspace  and  then  choosing  the  “best”  solution  from 
the  resulting  subspace.  In  terms  of  linear  algebra,  these  operations  consist  of  one  or  more  sparse  matrix-vector 
multiplications  (SpM\Ts)  and  vector  operations  in  each  iteration,  where  the  solution  xk  and  residual  rk  are  updated 
as 


xk  =  x*-1  +  a^p^1  rk  =  r*-1  -  ak-iAp*-1  (1) 

or  something  similar.  This  encompasses  algorithms  such  as  Conjugate  Gradient  (CG),  steepest  descent,  Biconjugate 
Gradient  (BICG),  Conjugate  Gradient  Squared  (CGS),  and  Stabilized  Biconjugate  Gradient  (BICGSTAB). 

It  is  important  to  notice  that  xk  and  rk  have  different  round-off  patterns.  That  is,  the  expression  for  xk  does 
not  depend  on  rk,  nor  does  the  expression  for  rk  depend  on  xk .  Therefore,  computational  errors  made  in  xk  are  not 
self-correcting.  Throughout  the  iteration,  these  errors  accumulate,  and  cause  deviation  of  the  true  residual,  b  —  Axk, 
and  computed  residual,  rk .  This  limits  the  maximum  attainable  accuracy,  which  indicates  how  accurately  we  can 
solve  the  system  on  a  computer  with  machine  precision  e.  When  the  algorithm  reaches  this  maximum  attainable 
accuracy,  the  computed  residual  will  appear  to  continue  decreasing  in  norm,  whereas  the  norm  of  the  true  residual 
stagnates.  This  can  lead  to  a  very  large  error  in  the  solution  despite  the  algorithm  reporting  a  very  small  residual 
norm. 

This  has  motivated  the  use  of  strategies  such  as  restarting  and  residual  replacement  to  limit  the  error  that 
accumulates  throughout  the  computation  (see,  e.g.,  [20,  24]).  The  solution  is  not  as  simple  as  using  the  true  residual 
in  every  iteration  (or  even  every  s  iterations).  In  addition  to  increasing  both  the  communication  and  computation 
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in  the  method,  replacing  the  recursively  computed  residual  with  the  true  residual  can  destroy  the  super-linear 
convergence  properties  exhibited  by  KSMs,  as  it  is  these  recurrences  which  drive  the  Lanczos  process  [20,  24]. 
Residual  replacement  strategies  must  then  carefully  select  iterations  where  residual  replacement  takes  place,  which 
requires  estimating  the  accrued  rounding  error.  Van  der  Vorst  and  Ye  have  successfully  implemented  such  a  strategy 
for  standard  Krylov  methods  [24]. 

The  computation  that  occurs  in  each  iteration  in  standard  Krylov  methods,  namely,  the  updates  of  xk  and  rk, 
consist  of  one  or  more  SpMV  and  vector  operations.  Because  there  are  dependencies  between  iterations  in  standard 
Krylov  methods  and  the  main  kernels  in  each  iteration  have  low  computation  to  communication  ratios,  standard 
Krylov  method  implementations  are  communication-bound  on  modern  computer  architectures. 

This  motivated  s— step,  or,  Communication- Avoiding  KSMs  (CA-KSMs),  which  are  equivalent  to  the  standard 
KSM  implementations  in  exact  arithmetic.  These  variants  use  blocking  strategies  to  perform  s  computation  steps 
of  the  algorithm  for  each  communication  step,  allowing  an  O(s)  reduction  in  total  communication  cost  (see,  e.g., 
[2,  3,  4,  6,  7,  10,  11,  16,  19,  21,  22]).  Despite  attractive  performance  benefits,  these  variants  are  often  considered 
impractical,  as  increased  error  in  finite  precision  can  negatively  affect  stability.  The  deviation  of  the  true  and 
computed  residual  observed  in  standard  KSMs  is  worse  for  CA-KSMs,  with  the  upper  bound  depending  on  s. 
Although  many  previous  authors  have  observed  this  behavior  in  CA-KSMs  (see,  e.g.  [3,  5,  4,  11,  19,  25]),  we  are 
the  first,  to  our  knowledge,  to  provide  a  quantitative  analysis  of  round-off  error  in  these  algorithms  which  limits  the 
maximum  attainable  accuracy.  Our  analysis,  which  follows  the  analysis  for  standard  KSMs  in  [24],  leads  directly  to 
an  implicit  residual  replacement  strategy  to  reduce  such  error. 

Our  numerical  experiments  suggest  that,  for  solving  Ax  =  b,  if  the  corresponding  standard  method  with  residual 
replacement  converges  such  that  the  norm  of  the  true  residual  is  0(e||A||  •  | |cc|  |),  and  all  ( s  +  l)-dimensional  Krylov 
bases  generated  in  our  CA-KSM  are  numerically  full  rank,  our  methods  will  also  converge  with  norm  of  the  true 
residual  equal  to  0(e||A||  •  ||x||)  when  our  residual  replacement  strategy  is  employed.  Furthermore,  we  note  that  if 
we  generate  the  Krylov  basis  using  properly  chosen  Newton  or  Chebyshev  polynomials,  the  norm  of  the  basis  grows 
slowly  with  s.  Therefore,  the  number  of  residual  replacement  steps  for  these  bases  will  generally  grow  slowly  with 
respect  to  the  total  number  of  iterations,  and  we  claim  that  stability  in  CA-KSMs  can  be  achieved  with  no  asxjmptotic 
increase  in  the  communication  cost  of  s  steps. 

1.1  Related  Work 

We  briefly  discuss  related  work  in  the  areas  of  s-step  and  CA-KSMs,  as  well  as  work  related  to  the  numerical  analysis 
of  standard  KSMs. 

1.1.1  s-step  Krylov  Subspace  Methods 

The  first  instance  of  an  s— step  method  in  the  literature  is  Van  Rosendale’s  conjugate  gradient  method  [19].  Van 
Rosendale’s  implementation  was  motivated  by  exposing  more  parallelism  using  the  PRAM  model.  Chronopoulous 
and  Gear  later  created  an  s— step  GMRES  method  with  the  goal  of  exposing  more  parallel  optimizations  [5].  Walker 
looked  into  s-step  bases  as  a  method  for  improving  stability  in  GMRES  by  replacing  the  modified  Gram-Schmidt 
orthogonalization  process  with  Householder  QR  [25].  All  these  authors  used  the  monomial  basis,  and  found  that 
convergence  often  could  not  be  guaranteed  for  s  >  5.  It  was  later  discovered  that  this  behavior  was  due  to  the  inherent 
instability  of  the  monomial  basis,  which  motivated  research  into  the  use  of  other  bases  for  the  Krylov  Subspace. 

Hindmarsh  and  Walker  used  a  scaled  (normalized)  monomial  basis  to  improve  convergence  [10],  but  only  saw 
minimal  improvement.  Joubert  and  Carey  implemented  a  scaled  and  shifted  Chebyshev  basis  which  provided  more 
accurate  results  [12].  Bai  et  al.  also  saw  improved  convergence  using  a  Newton  basis  [1].  Although  successively 
scaling  the  basis  vectors  serves  to  lower  the  condition  number  of  the  basis  matrix,  hopefully  yielding  convergence 
closer  to  that  of  the  standard  method,  this  computation  reintroduces  the  dependency  we  sought  to  remove,  hindering 
communication-avoidance.  Hoemmen  resolves  this  problem  using  a  novel  matrix  equilibration  and  balancing  approach 
as  a  preprocessing  step,  which  eliminates  the  need  for  scaled  basis  vectors  [11]. 

Hoemmen  et.  a{  [6,  11,  16]  have  derived  Communication- Avoiding  variants  of  Lanczos,  Arnold!,  CG  and  GMRES. 
The  derivation  of  Communication- Avoiding  variants  of  two-sided  Krylov  subspace  methods,  such  as  BICG,  CGS, 
and  BICGSTAB  can  be  found  in  [2]. 
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1.1.2  Error  Analysis  of  Krylov  Subspace  Methods 

An  upper  bound  on  the  maximum  attainable  accuracy  for  KSMs  was  provided  by  Greenbaum  [9].  Greenbaum  proved 
that  this  bound  can  be  given  a  priori  for  methods  like  CG,  but  can  not  be  predetermined  for  methods  like  BICG, 
which  can  have  large  intermediate  iterates.  Additionally,  Greenbaum  has  shown  the  backward  stability  of  the  CG 
algorithm,  by  showing  that  the  Ritz  values  found  lie  in  small  intervals  around  the  eigenvalues  of  A.  There  are  many 
other  analyses  of  the  behavior  of  various  KSMs  in  finite  precision  arithmetic  (see,  e.g.  [14,  15,  23]).  The  reader  is 
also  directed  to  the  bibliography  in  [17]. 

Sleijpen  and  Van  der  Vorst  implemented  a  technique  called  “flying  restarts”  to  decrease  the  amount  of  round-off 
error  that  occurs  in  KSMs  [20].  Their  method,  which  is  applicable  to  many  KSMs,  iteratively  tracks  an  upper 
bound  for  the  amount  of  round-off  that  has  occurred  in  the  iteration  so  far.  Using  this  upper  bound,  the  algorithm 
may  decide,  at  each  iteration,  to  perform  a  “group  update”,  to  restart  the  algorithm  (setting  the  right  hand  side 
appropriately),  or  both.  The  benefit  from  using  a  group  update  strategy  is  analogous  to  grouping  to  reduce  round¬ 
off  error  in  finite  precision  summation.  Following  this  work,  Van  der  Vorst  and  Ye  devised  a  residual  replacement 
strategy,  which,  rather  than  restarting,  replaces  the  residual  with  the  computed  value  of  the  true  residual,  combined 
with  group  updates  [24].  This  residual  replacement  occurs  at  iterations  chosen  such  that  two  objectives  are  met:  1) 
the  accumulated  round-off  does  not  grow  so  large  as  to  limit  the  attainable  accuracy,  and  2)  the  Lanczos  process  is 
not  perturbed  so  much  as  to  slow  the  rate  of  convergence.  To  determine  when  these  conditions  are  met,  the  algorithm 
iteratively  updates  a  bound  on  the  error  accrued  thus  far.  Our  analysis  closely  parallels  that  of  Van  der  Vorst  and 
Ye. 

1.2  Communication- Avoiding  Conjugate  Gradient 

We  briefly  review  the  Communication- Avoiding  Conjugate  Gradient  algorithm  (CA-CG),  given  in  Algorithm  1.  We 
chose  CG  for  simplicity,  although  the  same  general  technique  can  be  applied  to  other  KSMs  as  well.  In  the  interest  of 
space,  we  will  not  derive  the  algorithm  here,  but  instead  refer  the  reader  to  numerous  other  works  on  the  topic,  such 
as  [3,  5,  6,  11,  13,  19,  22].  The  CA-CG  method  has  both  an  inner  loop,  which  iterates  from  j  =  1  :  s,  and  k  outer 
loop  iterations,  where  k  depends  on  the  number  of  steps  until  convergence  (or  some  other  termination  condition). 
Therefore,  we  will  index  quantities  as  sk  +  j  for  clarity. 

In  CA-CG,  we  do  not  update  xsk+i ,  rsk+i ,  and  psk+i  directly  within  the  inner  loop,  but  instead  update  their 
coefficients  in  the  Krylov  basis 

Vk  =  [ Pk ,  Rk ]  =  [p0(A)psk,  Pl(A)psk,...,  ps(A)psk,  Po(A)rsk,  Pl(A)rsk, ...,  pa-i{A)rsk] 

where  p*  is  a  polynomial  of  degree  i.  We  assume  a  three-term  recurrence  for  generating  these  polynomials,  defined 
by  parameters  parameters  7j,  9i,  and  ap 

pi{z)  =7 i{A-  0il)pi-i(z)  +  <JiPi-2(z) 

This  Krylov  basis  is  generated  at  the  beginning  of  each  outer  loop,  using  the  current  r  and  p  vectors,  by  the 
communication-avoiding  matrix  powers  kernel,  denoted  Akx()  below.  We  then  represent  xsk+i ,  rsk+\  and  psk+i  by 
coefficients  esfc+-7',  csk+\  and  ask+R  which  are  vectors  of  length  0(2s  +  1),  such  that 

xsk+j  =  Vkesk+j  +  xsk 

rsk+j  _  ykcsk+j 
pSk+j  _  ykask+j 

The  matrix  powers  kernel  also  returns  the  tridiagonal  matrix  T,  of  dimension  (s  +  1)  x  s,  constructed  such  that 

APk  =  Pk+lTl+1 

A  TDk  _  JDk  rp 

AKi  —  Ki+1±i+ 1 

where  Pk,  Rk  are  n  x  i  matrices  containing  the  first  i  columns  of  Pk  or  Rk,  respectively,  and  T)+ 1  is  the  matrix 
containing  the  first  i  +  1  rows  and  first  i  columns  of  T.  Let  Vk  =  [Pk,  Rk_  1].  This  allows  us  to  write 
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AVf  -  A[Pj  ,  Rk_,j  =  \Pj+1Tj+i,  RjTj]  =  Vk+1 


T 


j+ 1 


T^- 


Let  Tj+1  = 


Lj+ 1 


Tj 


.  We  can  now  write  an  expression  for  Apsfc+-7  as  follows, 


Apsk+j 


AVkask+j 

{AVk)dsk+j 

Vk  T'  risk+i 

VkT'ask+j 


where  ask+- 7  denotes  the  nonzero  entries  of  asfc+J,  in  order  to  match  dimensions  with  Vk.  Therefore,  T'ask+ *  are  the 
Krylov  basis  coefficients  for  Apsk+P  This  expression  allows  us  to  avoid  explicit  multiplication  by  A  within  the  inner 
loop,  and  thus  allows  us  to  avoid  communication. 


Algorithm  1  CA-CG  Method 

x°,  r°  =  b  —  Ax°,  p°  =  r° 
k  =  0 

while  (not  converged) 

[Pk,  Rk,  T]  =  Akx(A ,  [psk,  [a  +  l,  8],[\po,...,p,]]) 

//where  pi  is  a  polynomial  of  degree  i 
Let  Vk  =  [Pk,  Rk],Gk  =  (Vk)TVk 
//Initialize  coefficient  vectors,  which 

//will  be  maintained  such  that  xsk+j  =  Vk  esk+:>  +  xsk ,  rsk+j  =  Vkcsk+j,  psk+]  =  Vkask+j 
ask  =  [1,  0  2sf,csk  =  [0s+i,  1,  Osf,  esk  =  [02s+i] 
for  j  =  1  :  s 

aafc+j-_i  =  ((csk+j-1)TGk(csk+j-1))/({ask+:’-1)TGk(T,ask+j-1)) 
e sk+j  =  e^+J"1  +  agfc+j_i  a8^'"1 

Csk+J  =  csk+j- i  _  v  \ask+j^ask+j- x) 

Psh+j- 1  =  {{csk+i)TGk{csk+i))/{{csk+i-l)TGk(csk+:’-1)) 

ask+0  =  csk+j  +  /3sk+j_iask+i- 1 

end  for 

xsk+s  =  pfc]eSfc+s  +  ,pfc 

__  J pk  j^k^sk+s 
psk+s  =  | pfe;  i?fc]asfc+s 

/c  =  k  +  1 

end  while 
return  xsfc 


2  Error  in  Finite  Precision  CA-KSMs 

We  assume  A  is  a  floating  point  matrix.  Throughout  this  analysis,  we  use  a  standard  model  of  floating  point 
arithmetic: 


fl(x  +  y)  =  x  +  y  +  6  with  |<5|  <  e(|x  +  y\) 

fl(Ax)  =  Ax  +  5  with  |<5|  <  em^|A|  •  \x\  +  0(e2) 

where  e  is  the  unit  round-off  of  the  machine,  x,  y  £  RN,  and  rriA  is  a  constant  associated  with  the  matrix-vector  mul¬ 
tiplication  (for  example,  the  maximal  number  of  nonzero  entries  in  a  row  of  A).  All  inequalities  are  componentwise. 
Using  this  model,  we  can  also  write 
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fl(y  +  Ax)  =  y  +  Ax  +  S  with  |<5|  <  e(| y  +  Ax\  +  toa|A|  •  |a;| )  +  0(e2) 

We  can  now  perform  an  analysis  of  round-off  error  in  computing  the  updates  in  s-step  methods. 

2.1  Error  in  Iterate  Updates 

In  the  communication-avoiding  variant  of  CG,  we  represent  and  symbolically  update  vectors  xsk+ J  =  Vkesk+ +  xsk 
and  rak+i  =  Vkcsk+ J  by  their  coefficients  in  the  basis  Vk  =  [ Pk ,  Rk],  where  Pk  and  Rk  are  the  O(s)  dimensional 
Krylov  basis  vectors  with  starting  vectors  psk  and  rsk,  respectively.  These  vectors  are  initialized  as 

esfe  =  [02s+1]t,  csk  =  [0S+1, 1, 0S]T 
In  the  inner  loop,  we  update  these  coefficients  as 

esk+j  =  esk+j-i  +  ask+jXask+^1  (2) 

csk+j  =  csk+j~l -T' (ask+3-iask+i-1)  (3) 

When  Equations  (2)  and  (3)  are  implemented  in  finite  precision,  they  become 

esk+j  =  Jl(c*k-  j  '  +  asu+j^a3^-1)  =  eek+i~l  +  ask+j_iask+^x  +  £sfc+J  (4) 

\£sk+j\  <  e\esk+j\  +  0(e2)  (5) 

csk+j  =  fl{csk+j-k  -  r  (asfc+J_1asfe+J~1))  =  csfe+J_1  -  r  (a.fc+j-r a’™-1)  +  r,sk+j  (6) 

\r)sk+j\  <  <\csk+J\  +  mT\r\ ■  \ask+j^ask+^\)  +  G(e2)  (7) 


Note  that  the  rounding  errors  in  computing  do  not  affect  the  numerical  deviation  of  the  true  and 

computed  residuals  [24].  Rather,  the  deviation  of  the  two  residuals  is  due  to  the  different  round-off  patterns  that 
come  from  different  treatment  of  asfc+J_iasfc+:,_1  in  the  recurrences  for  esfc+J  and  csk+R  Therefore,  we  let  the  term 
ctsk+j-ici sfc+l_1  denote  the  computed  quantity. 

To  avoid  confusion,  we  let  xsk+:>  =  ykgsk+j  _|_  anc[  fsk+j  _  yk^sk+j  denote  the  exact  values  of  expressions 
whose  constituents  (xsk ,Vk ,esk+i  and  csfe+J)  are  computed  in  finite  precision.  We  use  xsk+i ,  fsfc+J  to  denote  the 
floating  point  evaluations  of  the  same  expressions.  Assuming  x°  =  0, 


k- 1 


tk+j  =  fl(xsk+i)  =  fl{fl(Vk  •  esk+j)  +  £sk)  =  Vkesk+j  +  ^sk+J  +  ^(y‘esi+!  +  ^si+s) 


and 


k- 1 


t=0 

fc-1  \ 

sk+j  _j_  \’  n^si-\-s 


=  Vke3k+j  +  V  yiesi+s  +  ibsk+:>  + 

v  h  )  v  h 

\i/>sk+j\  <  e{\tk+j\  +  mv\Vk\  ■  \esk+i\)  +  <3(e2) 


(8) 


(9) 


fsk+j  =  fi^sk+jj  =  fl(ykcsk+i)  =  yk£sk+o  +  ^k+3 

\<t>sk+j\  <  e{\?sk+j\+mv\Vk\  ■  \csk+j\ 
We  can  then  write  an  expression  for  x3k+i  in  terms  of  xsk+A. 


(10) 

(ID 


X 


sk+j  _  yk -sk+j 


k- 1 


ykgsk+j  +  ^yiesi+s  +  V’s?;+s) 


isk+j  _ 


=  X 


i= 0 

£sk+j  -  ipsk+j 

sk+j  _L  ^psk+j 


(12) 
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Note  that  we  don’t  need  to  explicitly  compute  xsk+^  or  fsfe+J  within  an  inner  loop  iteration  in  order  to  update  the 
representation  of  the  current  solution,  esk+ J’,  and  residual,  csk+^ ,  in  the  Krylov  basis  in  the  next  inner  loop  iteration. 
Therefore  the  round-off  error  in  computing  xsk+i  and  fsk+J  is  not  cumulative  between  inner  iterations  -  the  only 
error  that  accumulates  is  the  error  in  updating  esfc+J  and  csk+A 

In  the  following  subsection,  we  analyze  round-off  error  that  occurs  in  finite  precision  CA-KSMs.  We  will  obtain 
an  upper  bound  for  the  norm  of  the  difference  between  the  true  and  computed  residual  at  step  sk  +  j. 

2.2  Deviation  of  the  True  and  Computed  Residual 

We  can  premultiply  Equation  4  by  AVk  to  write  an  expression  (in  exact  arithmetic)  for  the  value  of  Axsk+i , 

Axsk+j  =  A(Vkesk+j  +  isk)  =  AVkesk+j~x  +  ask+j-1AVkask+j~x  +  A£sk  +  AVk£sk+j 
and  we  can  premultiply  Equation  (6)  by  Vk  to  write  an  expression  (in  exact  arithmetic)  for  rsk+i\ 

rsk+j  =  ykdsk+j  =  yk^sk+j- 1  _  ask+j_1VkT'ask+j~x  +  Vknsk+j 

We  can  now  write  an  expression  for  the  difference  between  the  true  residual  and  the  computed  residual  using  our 
recurrences  for  Axsk+J  and  fsfc+K 


(13) 

(14) 


b  -  Axsk+J  -  rsk+j  =  b-  AVkesk+j  -  Aisk  -  Vkcsk+j 

=  (b-  AVkesk+J~x  -  A£sk  -  Vkcsk+i~x)  -  {ask+j-xAVkaak+i~x  -  ask+j^ VkT'ask+i~x) 

-  (. Avk£sk+j  +  vknsk+j) 

=  (b-Axsk+j~x  -  fsk+j~x)  -  (asfc+i_i AVkask+J~x  -  asfc+J-_it >kT,aak+j~1) 

-  (. Avk£sk+j  +  vknsk+j) 

j 

=  (b-  Aisk  -  ?sk)  -  J2(ask+t-iAVkask+i-x  -  ask+i~iVT' ask+i~x  +  AV^sk+i  +  Vr,sk+i) 

i= 1 

(15) 

Then  we  can  bound  the  2-norm  as: 


\\b-  Axsk+j  -fsk+j\\2  <  \\b- A£sk  -fsk\\2 

j 

+  J2  ( 1 1  iAVkask+i~ 1  -  ask+i-iVkT'  ask+i~x\\2  +  \\AVk£sk+%  +  \\Vkifk+i\\ 2)  (16) 

i= 1 

The  first  term  on  the  right  hand  side,  ||fc  —  Axsk  —  fsfc||2,  gives  the  norm  of  the  accumulated  error  at  the  start 
of  this  outer  loop  iteration.  The  remaining  terms  on  the  right  hand  side  denote  the  error,  or  the  deviation  of  the 
computed  from  the  true  residual,  accrued  in  each  inner  iteration  due  to  finite  precision  coefficient  updates.  In  order 
to  determine  when  the  true  and  computed  residual  have  deviated  too  far,  we  need  to  keep  track  of  an  estimate  of 
these  quantities,  and  do  it  in  a  communication-avoiding  way.  We  will  first  address  the  summation  term,  or,  the  error 
in  the  coefficient  updates. 

2.2.1  Error  in  Coefficient  Updates  within  Inner  Loop 

We  will  go  through  and  bound  each  term  in  the  summation:  (l)ash+i-iAVkask+t~x—ask+i-iVkT'ask+%~x,  (2 )AVk£tsk+\ 
and  (A)Vkifk+l .  Throughout  this  analysis,  we  will  tend  to  favor  the  2-norm.  Although  the  analysis  could  be  done 
using  any  p—  norm,  the  2-norm  quantities  are  easily  computable  in  a  communication-avoiding  fashion,  since  the 
0(2s  +  1  x  2s  +  1)  Gram  matrix  encodes  the  dot-products  with  the  basis  vectors.  For  the  remainder  of  this  paper 
we  will  drop  terms  higher  than  0(e)  for  simplification. 
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Theorem  2.1.  Let 


1  —  1?  -^2—2]  —  ['Vp,ski'“i  ^p,  sk-\-i— 15  ski  ••••)  fir,  sk-\-i— 2] 

Vf  =  [J*  i?^]  =  [A-l,  «Pl-fc+i,  ^-2,  Vr,sk+i- 1] 

be  matrices  of2i—l  and  2z+l  6aszs  vectors,  respectively,  for  a  Krylov  subspace  with  A.  Then  \\o.sk+i-iAVk_^askJr‘l~l  - 
asfc+j_iVrifeT/asfc+*-1||2  zs  bounded  above  by 

| |aafc+i_1AV£.1a»fc+i-1  -  a,Hi.1^fa<w-1||2 


where 


<  \\AVf_i  -  Vf  T'||2  •  1 1  avfc-H- 1 asfe+i_  1 1 1 2 

p,  sfc+1  /* p ,  sfc+2  Ar,  sfc+1 

I  S  II  1 1  S 


<  \/N  ■ 


max 


,CP’ 


7i 


|2 >  •••? 


7* 


|2> 


71 


|2?  •••? 


'-r,  s/c+i— 1 


7* 


|i«fc+i_i«fc+i-i||2\ 


,CP' 


,  s/c+£ 


7t 

/•r,  s/c+£ 

I  >  I 


12  ^ 


e 


■  | |#p,sk+t 1 12  +  (2|#t|  +  2m.j4 1 1 -A| I2)  •  ||wP) sfe+t-ilb  H — pp-p  ■  !!%>, sfe+* — 2 1 12^  1  <  t  <  i 
2  <  e  •  ||'Wr,sfc+t||2  +  (2|0t|  +  2mj4||^4||2)  •  ||'0r,sfc+t-l||2  H - |~y  •  ||Vr,sfe+t-2||2^  1 


7t 

Proof.  We  will  bound  ||asfc+i_i  AVk_iask^l 

—  asfc+,;_i VikT'ask+l  1 1 1 2,  computed  exactly.  We  can  rewrite  this  as 

IKfc+j-rAV^a8*-*-1  -  a8fc+i-iV;fcr,aifc+i-1||2  <  -  V*T'\\2  •  ||a8fc+i_1o8fc+<-1||2 

First  we  will  bound  ||asfc+i_iasfc+I_1||2.  Using  equation  (4),  we  can  write 


®-sk-\-i— 1& 


s/c+i— 1  _  _  £sk-\-i—  1  _ 


=  e  1  —  e 


|asfc+j_iasfe+i_1|  <  |esfc+*  -  esfc+,-1|  +  |«ffe+,;| 


|a8fc+i_1o8fc+i-1||2  <  ||esfc+4  -  esfc+*-1||2  +  e||e8fe+i||2 

rkn-if 


(17) 


Now,  the  term  left  to  bound  is  ||AT^_1  —  V^T' 1 12-  We  know  that,  computed  in  finite  precision, 

=  \Pi—li  Vp,sk+ii  P-i  —  2)  W,  sfe+i— l] 

where  vPtSk+i  is  a  basis  vector  for  the  Krylov  subspace  with  starting  vector  psk,  defined  by  the  formula  (similarly  for 

W,  sk+i—  1 ) : 


Vp,  sk+i  —  li(A  @iI')Vp'Sk+i~\-\-(JiVpt)Sk+i—2  (IS) 

—  TiAip,  sk+i—  1  Tidi.Vp,  sk+i  —  1  “t”  ^i^p,  sk+i— 2 

Parameters  7,,  9i,  and  eq  are  coefficients  defining  the  three-term  polynomial  basis  for  the  Krylov  subspace.  In  the 
monomial  basis,  6i  and  cq  are  always  0,  and  7 =  1.  For  Newton,  7,  =  1,  cq  =  0,  and  9i  are  chosen  to  be  eigenvalue 
estimates  (Ritz  values),  ordered  according  to  the  (modified)  Leja  ordering.  For  Chebyshev,  these  parameters  are 
chosen  based  on  the  bounding  ellipse  for  the  estimated  eigenvalues  of  A. 

When  Equation  (18)  is  implemented  in  finite  precision,  we  get  (See  Appendix  A),  similarly  for  tv, sfc+i-i: 
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Vp,  sk+i  fl(Vp,sk+i)  Yi(^l  sk+i—  1  +  &iVp,  sk+i— 2  “t”  C 

=  ^iAVp^  sfc+i—  1  ”fi@iVp,  sk+i—  1  +  &iVp,  sk+i— 2  +  C  (19) 

|CP,sfe+*l  <  e(|«p,sfc+i|  +2(17*0*1  +mA  ■  |7i|  '  H)  •  |«p,«fc+*-i|  +  2| o-jfip, sfc+i-2|)  (20) 

Now,  rearranging  Equation  (19),  we  get  an  expression  for  At)^  sfc+*_i  (or,  similarly,  Avr>  sfc_|_*_2): 

^d+p,  sfc+z— 1  =  [^p,  sfc+i  +  'Yi@i’Vp,  sk+i—  1  ^i^p,  sk+i  — 2  '  ] 

Yi 

1  .  ..  .  a*.  cp’sk+t 

—  Vp,  sk-\-i  ~i  ^i^p,  sk+i—  1  Vp,  sk-\-i— 2 

7»  7i  7* 

Notice  that  the  right  hand  side  is  a  multiplication  of  the  finite  precision  basis  vectors  by  T,  our  tridiagonal  matrix 
with  change  of  basis  coefficients,  plus  the  error  term,  — .  To  write  AVji l1,  we  can  write  the  above  as  a  matrix 
equation: 


dl  [+p,  ski  •■*)  Vp,  sk+i— 1)  ^r^skt  •••■>  W,  sk+i— 2] 


—  fop,  ski  *••>  Vp,sk+ii  ^r^skt  *•*)  W,  sfc+i— l] 

£p,  s/c+1  ^p,  sk+i  £r,  s/c+1  £r,  s/c+z— 1 


, 

7i  7*  7i  7* 


AEfj  =  V?T  - 


^p,  sk+ 1  £p,  s/c+i  £r,  s/c+1  £r,  s/c+i— 1 


, 

7i  7*  7i  7i 


We  can  rearrange  the  above  equation  to  get 


/*p,  sfc+1  Ap,  sfc+i  /-r,  sfc+1  /-r,  sk+i—  1 

k  \rkrj-\r  r  S  S 


-  vyr  =  - 


, 

Ti  7 %  7i  7* 


Taking  the  norm  of  both  sides,  we  see  that 


II^m  -  VfT'h  < 


rCP 


,  sk+ 1 


£p,  sfe+i  £r,  sfc+1 


C 


r,  sfc+i— 1 


, 

7i  7i 


<^ll[ 


Cp 


,  sfc+1 


, 

7i  7 i 


,  /r,  *fc+i 

=  VN  ■  max  (  ||- - ||i, 


7i  7 * 

£p,  s/c+i  £r,  s/c+1  £r,  s/c+z—  1 

i  J 

7i  7i 

/*p,  s/c+i  Ar,  s/c+1 

I  >  II  IP  I 


<  Vn  ■  max 

Now,  we  can  write  the  whole  bound  as 


7i 

+p,  sk+ 1 

I- - 1 

Yi 


\2,  ■■ 


7 i  Yi 

fp,  sk+i  /-r,  s/c+1 

I  >  II  11^  I 


Ar,  sk+i—  1 

I1’-’  H“  +.  II1 

Ar,  sk+i—  1 
I  >  II 


7 i 


Yi 


|2>  •••) 


7* 


asfc+i_1AlV;fc_1asfe+i~1  -  ask+i-iV^T' a 


krj-i/  sk-\-i— 1 1 


sk+i—  1 1 


<||+Vf_1-U/TT'||2.|Kfc+J_1a' 

/  /-p,  sk+ 1  +p,  s/c+i  /-r,  sfc+1  +r,  s/c+i— 1 

<  n/JV  •  max  (  1 1 - - 1|2 . ||- - ||2,|P - II2,-,  ||^- - ||2 


7i  Yi 

%sk+i_£sk+i- 1||2)  +0(e2) 


7i 


7* 


Using  Equation  (20),  we  get 
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,cp' 


,  sk-\-t 


it 


12  ^ 


<  e 


7 7  •  |  |^p,  sk+t\ I2  +  (2|0t|  +  2777.^4  1 1>1|  I2)  •  11%, ,  sfc+*— 1 1 12  +  “T — f  •  1 1  %j,  sk+t  —  2  \  I  2 

W  It*  I 


'-r,  sk-\-t 


it 


<  e 


1 - T  •  ||W  sfc+ilk  +  (2|^t|  +  2?77.^4 1 1  yl  1 12)  ’  |  \Vr,  sfc+t-1 1 1 2  4 - f — f  ‘  |  |W,  sfe+7  —  2 1 12 

lit  |  N 


i  —  1 


This  proves  the  Theorem.  □ 

We  have  two  terms  left  to  bound  in  Equation  (16),  AVk£sk+l  and  ykTjsk+t_  We  can  bound  the  2-norm  of  AVk£sk+l 


\\AVkek+%<\\A\\2-\\\Vk\-\Ck+t\\\-2 

<e\\A\\2.\\\Vk\.\esk+i\\\2  (21) 

Now,  to  bound  the  2-norm  of  ykrjsk+i^  we  plug  in  and  use  Equation  (7): 

\\vkvsk+%<\\\vk\-\vsk+i\\\2 

<  e||  \Vk\  •  (|csfc+J|  +  mT\T'\  ■  |asfc+l_1asfc+*-1|)||2 

<  e|| \Vk\ •  {\csk+l\+mT\T'\  ■  \esk+i  -  e8fc+i-1|)||2 

<  e( 1 1  \  Vk\  •  |csfc+t| \\2  +  mT\\T'\\2  •  ||  \Vk\  ■  \esk+i  -  e8**"1!  ||2)  (22) 

Putting  all  our  terms  together,  we  find 

\\b-Axsk+j-fsk+j\\2 

<  \\b-  A£sk  -fsk\\2 

+  J2  ^\ask+i.1AVkask+i-1  -  ask+i-iVkTr ask+i~1\\2  +  \\AVkCk+%  +  ||t>Vfc+i|2’ 

i—  1 

<  ||6- Afc8fc-f8*||2  (23) 

3  /  /-p,  sk-\-l  /-p,  sk-\-i  /-r,  sfc+1  sr,  sk-\-i—  1  \ 

H-^lv^-max  ||^ - 1|2,..„  ||i - 1|2,||^ - 1|2,..„  ||^ - 1|2  •(||e8fc+i-e8fc+<-1||2) 

V  7i  7 i  7i  7 i  ) 

+  e||T||2  •  ||  |Efc|  •  |esfe+!|  ||2  +  e(||  |Efc|  •  |csfe+*|  ||2  +  mT\\T'\\2  ■  ||  |Efe|  •  |esfc+i  -  e^"1!  ||2)] 

2.2.2  Error  in  Basis  Change  in  Outer  Loop 

Now,  we  want  to  bound  the  term  ||6  —  Axsk  —  fsfe||2  in  Equation  (23).  We  can  write,  again  assuming  a;0  =  0,  r°  =  b, 
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\\b -  Aisk  -  fsk\\2 
=  | \b  -  A(xsk  +  V’sfc)  -  fsk  -  <j>sk 

<  ||6  -  Axak  -  fsfc||2  +  || Aipsk  +  <t>sk ||2 

=  ||6-  Axs{k~1)+s  -  fs(fe"1)+s||2  +  || Aips{k~1)+s  +  (j)s(k-V+s 

<  ||6  -  At(k~l)  -  fs(fc-1)||2  +  \\Aijjs(k-1)+s  +  <^(fe-1)+s||2 


i  \  ^  ri  I  x-.  A  1  l)+2—  1  T/fc—  lrpf  s(k—  l)+i—  1 1 1 

+  2_^\-WaAk-i)+i-iAV  a  -  as(fc_i)+j_il/  ||2 


2—  1 


+  1 1  J4Vrfc-1^®(*-1>+<|  |2  +  ||y*-y(k-i)-K||2] 


fc-1 


<  ^  (  ||Ai/)si+s +  0si+s||2 +  JZ  [l|asZ+*-i^VriasJ+i_1  -  asi+i-1VlT'asl+i~1\\2  +  \\AVl£sl+%  +  ||VV+i||2 


1=0 
k- 1 


<^(||^+s||2  +  ||^+s||2) 


1=0 
k—  1  s 


+EE  (llarf+i-iAV-'o*'^-1  -  asi+i_1T>iT,asi+i-1||2  +  ||AVr,^,+i||2  +  ||VV+i||2) 


J=0  2—1 
k-1 


<E  01412 


isZ+si 


1=0 
k—1  s 


J2  E  (ll^+i-i^«s'+i_1  -  asl+.i-1VlT'asl+'l-1\\2  +  ||At>'^+i||2  +  ||VV+i|2 


1=0  2—1 


(24) 


This  bound,  in  words,  says  that  the  error  at  the  start  of  the  kth  outer  loop  iteration  is  the  sum  of  (1)  the  errors 
in  performing  coefficient  updates  in  every  inner  loop  iteration  executed  so  far  (iterations  1  through  sk)  and  (2)  the 
errors  in  computing  x  and  f  in  every  outer  loop  iteration  so  far  (outer  loop  iterations  1  through  k ). 


2.2.3  Total  Error 

Putting  the  terms  in  the  above  two  sections  together,  we  obtain  an  upper  bound  for  the  error  accumulated  at  iteration 
sk  +  j  : 
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\\b-  Axsk+j  -rsk+j\\2 
<  \\(b  -  Aisk  -  fsk)\\2 


+  (\\ask+i-1AVkask+l-1  -  oifc+j_1VfcT,o«fc+<-1'||2  +  \\AVkZsk+%  +  \\Vkvsk+%) 

i= 1 
k _ l 

<£(||^-I+»||2  +  ||^,+«||2) 


1=0 
k—1  s 


+  E  E  (l|a.i+i-i^Vr,oa,+i-1  -  a8j+i_1Vr,lV,+i-1||2  +  \\AVl£sl+%  +  ||VV+i||2) 


1=0  i—1 
3 


+  E  ( \  \ask+i-iAVkask+'1-1  -  +  WAV*?**' ||2  +  H^V^Ib 

i= 1 
fc-1 


=  E(llAll2-ll^,+sll2  +  ll^,+sll2) 


Z=0 

fc-1 


EE^- 


max 


Z=0  i—l 


/•p,  s/+l  /*p,  sZ+i  /-r,  sZ+1 

I  ^  II  1 1  ^  II  1 1  S 

1 2 5  •••,  || - 1 1 2 5 


,C 


r,  1 


7i 


7i 


-  e||A||2  •  II  |y;|  •  |esi+i|  ||2  +  e(| | \Vl\ ■  \csl+l\  ||2  +  mT\\T'\\2  -\\\Vl\-  \esl+i  -  esl+i~\  „2 


Elv^' 


max 


sk+ 1 


7i 

/-fc|  |~s/c+i 


I  2  5  •••5 


,CP 


sfc+i  Ar)  sfc+1 


7 * 


12,  | 


7i 


|2?  •••? 


,C 


r,  s/c+i— 1 


7* 


|2  I  '  (||esfe+i  —  esfc+i_1||2) 


+  e||A||2  •  ||  |Ffe|  •  |esfc+‘|  ||2  +  6(||  |Ffc|  •  |<*fc+*|  ||2  +  mT||T'||2 


k  |  _  I  gsk+i  _  gsk+i—l 


E  ^sfc+j 


(25) 


(26) 


where  we  will  use  dsfc+j  to  denote  an  upper  bound  for  ||6  —  Axsk+:>  —  fsk+J :||2.  By  the  equation  above,  we  can  keep 
track  of  this  quantity  iteratively,  by  updating  this  quantity  in  each  iteration  as  follows: 

Kl<j<s-1: 


dsk+j  =  dsk+j-i  +  \ /N  ■  max 

+  e||A||2.  ||  1^1-1^11 


Cp 


,  sk-\- 1 


|2j  •••> 


C V >  sk+j 

Is  II  IIs 

2) 


r,  sk-\- 1 


12?  ■ 


,C 


r,  sk+j  —  l 


7i  7j  7i  7? 

2  +  e(||  |Vfc|  •  |csfc+J|  ||2  +  mT\\T'\\2  •  ||  |V-|  •  \esk+i  -  esk+^\ ||2) 


\esk+j  -  esk+j-%) 


(27) 


If  j  =  s: 
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max  [  M - 112, 

7i 


dS(k+ 1) 

—  dsk-\-s  —  dsk-\-s—  1 

_ _  /  £P,sk+ 1 

+  vN  •  max  1 1 

2*111 

+  m\2-\\rk+s 

=  ^s/c+s—  1 

max 


,cp' 


,  sk-\-s  /*r,  sfc+1 

II  1 1 


,C 


r,  sk-\-s—  1 


- M2,  II - 1 12,  •••, 

7s  7i 


.(||g»*+»-g-fc+^1||2) 


+  e||A||2  •  ||  \  Vk\  ■  |esfc+s|  ||2  +  e(|| \Vk\ •  |csfc+s|  ||2  +  mT||T'||2  •  ||  |Ffc|  •  \esk+s  -  ||2) 


|2  +  ||^+s||2) 


/•p,sk-\- 1  /*p,  sk-\-s  /*r,  sfc+1  /-r,  s/c+s—  1 

v^V-  max  (  ||— - 1|3 . |M - l|2,ll- - lb,-,  ||- - 1|2  )  •  (||esfe+s  -  e“fc+'-1|i2i 


7i  7s  7i  7s 

+  eim|2  *  ((1  +  mv)  *  II  |Cfc|  *  |esfe+s|  ||2  +  ||£sfe+s||2) 

+  e((l  +  mv)  •  ||  |Ffc|  •  |csfe+s|  ||2  +  |rfc+s||2  +mT||T/||2  •  ||  |Ffc|  •  \esk+s  -  e^8"1!  ||2) 


(28) 


2.3  Avoiding  Communication  in  Computing  the  Upper  Bound 

In  each  iteration,  we  will  update  dsfc+J,  the  deviation  of  the  true  and  computed  residual,  given  by  Equations  (27) 
and  (28).  Section  3  will  discuss  how  this  quantity  is  used  to  determine  whether  or  not  residual  replacement  occurs 
at  a  given  iteration.  First,  however,  we  show  how  to  compute  the  value  of  dsk+i  in  a  communication  avoiding  way, 
to  avoid  reintroducing  the  communication  bottlenecks  that  we  sought  to  remove. 

We  can  assume  that  we  have  estimates  for  ||7L||2  and  | |T'| |2. These  quantities  need  be  computed  only  once,  since 
their  values  do  not  change  throughout  the  iteration.  The  remaining  quantities  we  must  compute  are 


1 .  max  (||^ 


sk+j 


sk+j  —  1 


7 j 


7 j 


2.  || \Vk\  ■  |esfe+J|  ||2  ,  || \Vk\  ■  \csk+j\  ||2  and  ||  |C|  •  \esk+i  -  esfc+^|  ||2, 

3.  ||lsfc+J'||2  and  |bsfe+J'||2,  and 

4.  ||esfc+i  -  esk+j-1\\2 


We  can  compute 


l£! 


i  if 


i  c: 


i  if 


1 2  ?  at  the  beginning  of  the  outer  loop,  and 


7l  ,11  7j.  11*1  II  7l  11*7  ,11  r)/j 

then  easily  choose  the  maximum  amongst  the  appropriate  2 j  —  1  scalar  quantities  within  each  inner  loop  iteration. 
Computing  this  set  involves  computing  the  norm  of  each  basis  vector,  which  can  easily  be  accomplished  by  use  of 
the  Gram  matrix  for  no  additional  communication  cost.  The  additional  computation  cost  is  0(s2)  per  s  steps. 

For  terms  in  2.  above,  we  have  2  choices.  The  first  option  is  to  upper  bound  each  quantity  as  ||  \Vk 


$  sk+j  I 


2  < 


|gsfc+j||2_  This  requires  no  additional  communication,  as  each  processor  can  compute  ||Ffc||2  using  the  Gram 


11^ 

matrix,  and  computing  ||esfe+-,||2,  ||csfe+J||2,  and  ||esfc+:'  —  esfe+J  1||2  are  all  local  work.  The  additional  computation 
is  0(s2)  per  outer  loop  iteration.  The  second  option  is  to  compute  G|y|  =  |FT|  •  |F|  in  each  outer  loop.  Then  G|y| 
can  be  used  to  compute  the  two-norm  of  these  quantities  without  communication  within  the  inner  loop.  Although 
this  provides  a  tighter  bound,  this  option  requires  an  additional  global  reduction  in  each  outer  loop  iteration  (the 
asymptotic  communication  costs  remain  the  same).  This  option  also  requires  0{s2n )  additional  computation  per 
outer  loop  iteration. 

We  also  use  the  Gram  matrix  to  compute  ||^sfc+:,||2  <  \J  (esk+j)T  G(esk+:>)+\ \xsk\ |2  and  ||rsfe+J||2  <  y '  (csfe+t)TG(csfe+J) 
,  where  ||a:sfe||2  and  ||rsfc||2can  be  computed  at  the  start  of  the  inner  loop  (since  we  must  communicate  to  compute 
and  distribute  the  values  of  xsk  and  rsk  anyway).  This  requires  0(n  +  s 2)  operations  per  outer  loop. 

Finally,  \\esk+i  —  esfc+J^1||2  can  be  computed  locally  by  each  processor,  so  no  additional  communication  is  required. 
Computing  this  quantity  in  each  inner  loop  requires  an  additional  0(s2)  operations  per  outer  loop. 

We  note  that  using  the  Gram  matrix  to  compute  inner  products  locally  does  result  in  an  extra  O(e)  error  term. 
However,  since  all  quantities  in  1-3.  above  are  multiplied  by  e  in  the  bound  for  dsk+j ,  these  O(e)  error  terms  become 
0(e2)  error  terms,  and  can  thus  be  ignored  in  our  bound. 
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The  communication  and  computation  costs  given  for  computing  dsk+ J  above  do  not  affect  the  overall  computation 
or  communication  cost  of  the  algorithm.  Therefore,  we  can  keep  track  of  the  quantity  dsk+i  in  each  iteration  without 
asymptotically  increasing  the  cost  of  communication  or  computation  in  CA-KSMs. 

3  Replacement  of  Residuals  in  CA-KSMs 

In  order  to  improve  the  maximum  attainable  accuracy,  we  want  to  replace  the  computed  residual  with  the  true 
residual  at  certain  iterations,  according  to  our  calculated  dsk+j  value.  We  must  choose  these  iterations  carefully  to 
satisfy  two  constraints:  1)  we  don’t  want  to  let  the  deviation  grow  too  large,  and  2)  we  don’t  want  to  lose  super-linear 
convergence  provided  by  the  underlying  Lanczos  process. 

Van  der  Vorst  and  Ye  [24]  have  suggested  the  following  condition  for  residual  replacement  to  satisfy  these  prop¬ 
erties: 


dsk+j- 1  <  e|| rsk+J  &&  dsk+j  >  e||fsfc+J||2  &&  dsk+j  >  1.1  dinit  (29) 

where  we  initially  set  do  =  dinu  =  e(||r°||2  +  m+\ |A| |2  •  ||a;0||2)  =  <?| |6| I2,  and  e  is  a  tolerance  parameter.  The  value 
e  =  i/e  has  been  found  to  be  a  good  value  empirically  for  standard  KSMs  [24],  and  we  have  observed  good  results 
for  CA-KSMs  as  well. 

In  our  case,  we  do  not  know  the  actual  value  of  ||rsfc+-,”1||2,  but  we  can  use  the  Gram  matrix  Gk  to  compute 
||pfc+j-i||2_  will  now  argue  that  we  can  replace  ||rsfc+-J”1||2  and  ||fsfe+J||2  with  ||fsfe+-7_1||2  and  ||fsA’+J||2  in 
Equation  (29).  Since 

fsk+j-l  __  fsk+j- 1  _  isk+j- 1 


we  can  say 


-sfc+j-l| 


2  <  e\\rk+j^\\2  +  e\\(t>sk+j-^ 


\2  +  e\\<P  "  1 1 2 

We  know  ||</>sfe+-,-1||2  =  0(e)  and  e  =  e1/2.  Since  ignore  powers  of  e  larger  than  0(e),  this  becomes: 


-||fak+j-1||2  <  e||Pfc+J-1||2  +  0(e3/2) 

<  e||Pfc+J-1||2 

Therefore,  we  can  use  ||pfc+i-i||2  in  our  replacement  condition  for  CA-KSMs.  An  analogous  argument  holds  for 
| \rsk+i 1 12.  Our  condition  for  residual  replacement  in  CA-KSMs  will  then  be 

dsk+j- 1  <  e||fsfe+J-1||2  &&  dsk+j  >  e| 1 12  &&  dsk+j  >  1.1  dinit  (30) 

If  this  statement  is  true,  we  perform  a  group  update  by  accumulating  the  current  value  of  xsk+j  into  vector  z,  as 
z  =  fl(z  +  xak+i),  and  we  set  fsfe+J  =  fl(b  —  Az). 

To  perform  a  residual  replacement  in  CA-KSMs,  all  processors  must  communicate  their  elements  of  xsk+i  to 
compute  b  —  Axsk+i ,  and  we  must  break  out  of  the  inner  loop  (potentially  before  completing  s  steps)  and  continue 
with  computing  the  next  matrix  powers  kernel  with  the  new  residual  in  the  next  outer  loop.  This  means  our 
communication  costs  could  potentially  increase  if  the  number  of  replacements  is  high  (i.e.,  we  compute  the  true 
residual  every  iteration),  but  our  experimental  results  in  the  next  section  indicate  that,  as  long  as  the  generated 
basis  is  numerically  full-rank  and  the  basis  norm  growth  rate  is  not  too  high,  the  number  of  replacements  is  low 
compared  to  the  total  number  of  iterations.  Therefore  the  communication  cost  per  s— steps  does  not  asymptotically 
increase.  A  formal  round-off  analysis  of  the  residual  replacement  scheme  using  this  condition  for  KSMs  can  be  found 
in  [24].  Our  future  work  will  include  a  round-off  analysis  of  this  replacement  scheme  for  CA-KSMs.  The  algorithm 
for  residual  replacement  can  be  found  below  in  Algorithm  2. 
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Algorithm  2  Residual  Replacement 

if  dsk+j- 1  <  e| — 1 1 12  &&  dsk+j  >  e| \rsk+J ||2  &&  rfsfe+j  >  1.1  dinit 
z  =  z  +  £sk+j 
rsk+j  =b-  Az 
xsk+j  =  0 

dinit  —  dk  =  e(||rsfc+J||2  +  mA\\A\\2  ■  ||2||2) 

reset=  1 
BREAK 

end  if _ 

We  can  now  give  the  algorithm  for  CA-CG  with  residual  replacement,  shown  in  Algorithm  3.  Blue  text  denotes 
new  code  added  to  Algorithm  1  for  the  purpose  of  residual  replacement. 
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Algorithm  3  CA-CG  with  Residual  Replacement 
Compute  ||A||2 
x°  =  0 ,  r°  =  6,  p°  =  r° 
k  =  0 

do  =  d^it  =  e(||r°||2) 
z  =  0 

reset=  0 

while  (not  converged) 

[P\  Rk,  T]  =  Akx(A ,  \psk ,  rs%  [s  +  1,  a],  [[Po,  ...,ps]]) 

//where  pi  is  a  polynomial  of  degree  i 
if(k==0)  Compute  ||T'||2 
Let  Vk  =  [ Pk ,  Rk],Gk  =  ( Vk)TVk 

//Initialize  coefficient  vectors,  which 

//will  be  maintained  such  that  xsk+j  =  Vkesk+j  +  xsk ,  rsk+j  =  Vkcsk+:> ,  psk+j  =  Vkask+j 
ask  =  [1,  0  2sf,csk  =  [0s+i,  1,  0S]T,  esk  =  [02s+i] 
for  j  =  1  :  s 

ask+j =  ((csk+j-1)TGk(csk+j-1))/((ask+j-1)TGk(T'ask+j-1)) 

esk+j  =  esk+j- i  +  ask+j_iask+j- 1 
Csk+J  =  csk+j- i  _  ask+j^T'a^+i-1 

Psk+j- 1  =  {(csk+:i)TGkicsk+j))/{{csk+j-1)TGk(csk+j-1)) 

ask+J  =  csk+j  +  pak+j_ia'k+j- 1 

Update  dsk+j  using  Eq.  (27) 

if  dsk+j-i  <  g| |rsfc-t-j— 1 1 1 2  &&  dsk+.  >  g||rsfc+j ||2  &&  dsk+_  >  i.idinit 
z  =  z  +  xsk+j 
rsk+j  —i_az 
x°k+i  =  o 

dinit  =  4  =  e(lksfe+Jl|2  +  mA\\A\\2  •  ||z||2) 

reset=  1 
BREAK 
end  if 
end  for 
if  reset!  =  1 

Update  dsk+s  by  Eq.  (28) 

xsk+s  =  jpfc)  pfe]e sk+s  +xsk 
^sk+s  _  Jpfc  pfej^sfe+s 

end  if 
reset=0 

psk+s  =  [pfc)  pfc]asfc+s 

/c  =  fe  +  1 

end  while 
return  2  +  xsk 


4  Experimental  Results 

We  evaluated  our  residual  replacement  strategy  on  a  few  small  matrices  (both  symmetric  and  unsymmetric)  from 
the  University  of  Florida  Sparse  Matrix  Collection,  using  the  CA-BICG  method  (or  CA-CG  where  appropriate).  In 
these  experiments,  we  compare  standard  (BI)CG  with  both  our  CA-(BI)CG  method  and  our  CA-(BI)CG  method 
with  residual  replacement.  We  ran  these  tests  for  s  =  [4,  8,  16].  To  lower  the  2-norm  of  the  matrix,  we  used  row 
and  column  scaling  of  the  input  matrix  A  as  a  preprocessing  equilibration  routine,  as  described  in  [11].  This  process, 
which  only  need  be  performed  once,  is  used  in  lieu  of  scaling  the  basis  vectors  after  they  are  generated,  which 
reintroduces  communication  dependencies  between  iterations.  For  each  matrix,  we  selected  a  right  hand  side  b  such 
that  ||a;||2  =  1,  Xi  =  l/i Jn.  We  have  found  empirically  that  using  a  replacement  tolerance  around  e  =  y/e  ,  the  same 
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value  used  in  Van  der  Vorst  and  Ye  [24],  ensures  that  the  true  and  computed  residual  remain  the  same  throughout 
the  computation. 

The  figures  below  are  organized  as  follows.  The  left  column  shows  the  convergence  of 

•  standard  (BI)CG  (black  line) 

•  standard  (BI)CG  with  residual  replacement  (black  dots)  [24] 

•  CA-(BI)CG  for  all  three  bases: 

—  Monomial  (blue  line),  Newton  (green  line),  Chebyshev  (red  line) 

•  CA-(BI)CG  with  residual  replacement  for  all  3  bases: 

—  Monomial  (blue  dots),  Newton  (green  dots),  Chebyshev  (red  dots). 

The  right  column  shows  our  upper  bound  estimates,  dsk+j,  for 

•  standard  (BI)CG  with  residual  replacement  (black  dashed  line)  [24] 

•  CA-(BI)CG  with  residual  replacement  for  all  3  bases: 

—  Monomial  (blue  dashed  line),  Newton  (green  dashed  line),  and  Chebyshev  (red  dashed  line) 

vs.  the  true  value  of  || —  rsk+J  1 1 2  for 

•  standard  (BI)CG  with  residual  replacement  (black  dots)  [24] 

•  CA-(BI)CG  with  residual  replacement  for  all  3  bases: 

—  Monomial  (blue  dots),  Newton  (green  dots),  and  Chebyshev  (red  dots) 

Table  1  shows  the  total  number  of  replacements,  and  the  iterations  at  which  the  replacements  occurred,  for  each 
experiment. 
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Table  1:  Residual  Replacement  Iterations  for  CA-BICG  Experiments.  Matrices  from  the  University  of  Florida  Sparse  Matrix  Collection.  Equilibration 
routine  used  on  A  to  reduce  norm.  Right  hand  side  chosen  such  that  ||rc||  =  1.  Tolerance  used  was  e  =  \Je.  CA-KSM  restarts  given  for  monomial 
basis  (M),  Newton  basis  (N),  and  Chebyshev  basis  (C).  A  indicates  that  a  Krylov  basis  generated  by  the  algorithm  was  not  full  rank. 
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Figure  1:  cddel.  Model  convection-diffusion  differential  equations.  Non-normal. 
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Figure  2:  jpwh991.  Unsymmetric  matrix  from  Philips,  LTD.  Semiconductor  device  problem.  Non-normal. 
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Figure  3:  meshleml.  Structural  problem  from  NASA.  SPD. 
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Figure  4:  mhdb416.  Magneto- hydro-dynamics  Alfven  spectral  problem.  SPD. 
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Figure  5:  nos4.  Beam  Structure  Model.  SPD. 
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Figure  6:  pde900.  Model  Problem,  Nx  =  Ny  =  30.  Non-normal. 
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5  Analysis  and  Conclusions 

Our  results  show  that  our  residual  replacement  scheme  for  CA-KSMs  is  indeed  effective,  as  has  been  observed  for 
KSMs.  For  all  test  cases,  the  Newton  and  Chebyshev  bases  converge  to  a  level  considered  to  be  backward  stable, 
with  the  2-norm  of  the  true  residual  equal  to  0(e||A||2  •  1 1 a’ 1 1 2 ) -  From  Table  1,  we  can  see  that,  using  the  Newton 
and  Chebyshev  bases,  the  number  of  residual  replacements  needed  to  maintain  stability  grows  slowly  (linearly)  with 
the  basis  size,  and  the  total  number  of  replacements  is  very  small  compared  to  the  total  number  of  iterations.  We 
expect  this  behavior  because  the  conditioning  of  these  polynomial  bases  grows  slowly  with  s  for  many  systems  when 
basis  parameters  are  chosen  appropriately  [18]. 

Additionally,  our  experimental  results  indicate  that  residual  replacements  do  not  significantly  slow  the  rate  of 
converge  for  CA-KSMs  with  the  Newton  and  Chebyshev  bases.  In  fact,  in  many  experiments,  the  CA-KSM  with 
residual  replacement  often  converges  at  a  faster  rate  than  the  CA-KSM  without  residual  replacement.  This,  combined 
with  the  observation  that  the  total  number  of  replacements  is  small  with  respect  to  the  total  number  of  iterations, 
leads  us  to  conclude  that,  using  an  appropriate  Krylov  basis,  we  can  increase  the  maximum  attainable  accuracy  of 
CA-KSMs  without  asymptotically  increasing  communication  or  computation  costs. 

For  the  monomial  basis,  however,  the  basis  condition  number  grows  exponentially  with  s  [8],  which  is  reflected 
by  a  large  number  of  replacements  in  our  results.  We  observed  that  for  all  test  matrices  except  the  last  two,  nos4 
and  pdeOOO,  the  monomial  basis  became  numerically  rank  deficient  at  some  point  during  the  algorithm  for  s  =  16 
(denoted  by  an  asterisk  in  Table  1).  Since  our  generated  Krylov  subspace  is  numerically  rank  deficient  here,  we 
expect  frequent  replacement  up  to  a  point  (depending  on  the  tolerance  parameter  e).  After  the  residual  becomes  so 
small  that  the  condition  for  replacement  is  no  longer  satisfied  (designed  with  the  goal  that  the  Lanczos  procedure  not 
be  disturbed),  replacements  will  stop.  At  this  point,  we  can’t  draw  meaningful  conclusions  about  the  behavior  of  the 
monomial  basis.  In  the  case  of  the  first  two  non-normal  matrices,  cddel  and  jpwh991,  the  method  does  not  converge 
in  this  case.  For  the  two  SPD  matrices  meshleml  and  mhdb416,  however,  the  method  does  converge  despite  the 
occurrence  of  a  numerically  rank-deficient  basis.  Whether  this  is  due  to  luck  or  some  underlying  properties  remains 
to  be  determined,  and  will  require  closer  examination  of  which  iterations  were  affected  by  a  degenerate  subspace. 
For  nos4  and  pde900,  both  numerically  full  rank  for  all  bases  at  s  =  16,  the  CA-KSM  with  residual  replacement  does 
converge,  whereas  the  CA-KSM  without  residual  replacement  does  not  converge  due  to  round-off  error. 

Much  work  remains  to  be  done  on  the  analysis  of  CA-KSMs  in  finite  precision.  In  the  immediate  future,  we 
plan  to  perform  an  analysis  of  the  replacement  scheme  chosen  in  finite  precision  to  support  our  experimental  results. 
We  also  plan  to  extend  this  analysis  to  other  CA-KSMs,  such  as  CA-BICGSTAB.  This  will  follow  the  same  general 
process  here,  modulo  a  few  extra  terms. 

We  can  also  extend  our  error  analysis  to  improve  CA-KSM  algorithms.  One  possibility  is  that  we  can  heuristically 
determine  the  maximum  s  value  we  can  use  given  e,  based  on  an  estimate  of  basis  norm  growth.  This  would  allow 
us  to  choose  an  s  value  that  is  as  large  as  possible,  without  requiring  a  large  number  of  residual  replacement  steps 
(which  limit  our  savings  in  communication).  The  error  analysis  performed  here  could  also  allow  for  dynamic  selection 
of  s;  depending  on  our  error  estimate  and  the  frequency  of  residual  replacements,  s  could  be  automatically  increased 
or  decreased  throughout  the  computation.  Many  opportunities  exist  for  future  research. 


References 

[1]  Z.  Bai,  D.  Hu,  and  L.  Reichel,  A  Newton  basis  GMRES  implementation ,  IMA  Journal  of  Numerical  Analysis 
14  (1994),  no.  4,  563. 

[2]  E.  Carson,  N.  Knight,  and  J.  Demmel,  Avoiding  communication  in  two-sided  Knjlov  subspace  methods ,  Tech, 
report,  Tech.  Report,  University  of  California  at  Berkeley,  Berkeley,  CA,  USA,  2011. 

[3]  A.T.  Chronopoulos  and  C.W.  Gear,  On  the  efficient  implementation  of  preconditioned  s-step  conjugate  gradient 
methods  on  multiprocessors  with  memory  hierarchy,  Parallel  computing  11  (1989),  no.  1,  37-53. 

[4]  A.T.  Chronopoulos  and  C.D.  Swanson,  Parallel  iterative  s-step  methods  for  unsymmetric  linear  systems,  Parallel 
Computing  22  (1996),  no.  5,  623-641. 

[5]  CW  Chronopoulos  et  ah,  S-step  iterative  methods  for  symmetric  linear  systems,  Journal  of  Computational  and 
Applied  Mathematics  25  (1989),  no.  2,  153-168. 


24 


[6]  J.  Demmel,  M.  Hoemmen,  M.  Mohiyuddin,  and  K.  Yelick,  Avoiding  communication  in  computing  Krylov  sub¬ 
spaces,  Tech,  report,  Technical  Report  UCB/EECS-2007-123,  University  of  California  Berkeley  EECS,  2007. 

[7]  D.B.  Gannon  and  J.  Van  Rosendale,  On  the  impact  of  communication  complexity  on  the  design  of  parallel 
numerical  algorithms,  Computers,  IEEE  Transactions  on  100  (1984),  no.  12,  1180-1194. 

[8]  W.  Gautschi,  The  condition  of  polynomials  in  power  form,  Math.  Comp  33  (1979),  no.  145,  343-352. 

[9]  A.  Greenbaum,  Estimating  the  attainable  accuracy  of  recursively  computed  residual  methods,  SIAM  Journal  on 
Matrix  Analysis  and  Applications  18  (1997),  535. 

[10]  A.C.  Hindmarsh  and  H.F.  Walker,  Note  on  a  Householder  implementation  of  the  GMRES  method,  Tech,  report, 
Lawrence  Livermore  National  Lab.,  CA  (USA);  Utah  State  Univ.,  Logan  (USA).  Dept,  of  Mathematics,  1986. 

[11]  M.  Hoemmen,  Communication-avoiding  Krylov  subspace  methods,  Thesis.  UC  Berkeley,  Department  of  Com¬ 
puter  Science  (2010). 

[12]  W.D.  Joubert  and  G.F.  Carey,  Parallelizable  restarted  iterative  methods  for  nonsymmetric  linear  systems.  Part 
I:  Theory,  International  Journal  of  Computer  Mathematics  44  (1992),  no.  1,  243-267. 

[13]  C.E.  Leiserson,  S.  Rao,  and  S.  Toledo,  Efficient  out-of-core  algorithms  for  linear  relaxation  using  blocking  covers, 
Journal  of  Computer  and  System  Sciences  54  (1997),  no.  2,  332-344. 

[14]  G.  Meurant  and  Z.  Strakos,  The  Lanczos  and  conjugate  gradient  algorithms  in  finite  precision  arithmetic,  Acta 
Numerics  15  (2006),  471-542. 

[15]  G.A.  Meurant,  The  Lanczos  and  conjugate  gradient  algorithms:  from  theory  to  finite  precision  computations, 
vol.  19,  Society  for  Industrial  Mathematics,  2006. 

[16]  M.  Mohiyuddin,  M.  Hoemmen,  J.  Demmel,  and  K.  Yelick,  Minimizing  communication  in  sparse  matrix  solvers, 
Proceedings  of  the  Conference  on  High  Performance  Computing  Networking,  Storage  and  Analysis,  ACM,  2009, 
p.  36. 

[17]  C.C.  Paige,  M.  Rozioznik,  and  Z.  Strakos,  Modified  Gram-Schmidt  (MGS),  least  squares,  and  backward  stability 
of  MGS-GMRES,  SIAM  Journal  on  Matrix  Analysis  and  Applications  28  (2006),  no.  1,  264. 

[18]  B.  Philippe  and  L.  Reichel,  On  the  generation  of  Krylov  subspace  bases,  Applied  Numerical  Mathematics  (2011). 

[19]  J.V.  Rosendale,  Minimizing  inner  product  data  dependencies  in  conjugate  gradient  iteration ,  (1983). 

[20]  G.L.G.  Sleijpen  and  H.A.  van  der  Vorst,  Reliable  updated  residuals  in  hybrid  Bi-CG  methods,  Computing  56 
(1996),  no.  2,  141-163. 

[21]  E.  Sturler,  A  performance  model  for  Krylov  subspace  methods  on  mesh-based  parallel  computers,  Parallel  Com¬ 
puting  22  (1996),  no.  1,  57-74. 

[22]  S.A.  Toledo,  Quantitative  performance  modeling  of  scientific  computations  and  creating  locality  in  numerical 
algorithms,  Ph.D.  thesis,  Massachusetts  Institute  of  Technology,  1995. 

[23]  C.H.  Tong  and  Q.  Ye,  Analysis  of  the  finite  precision  bi-conjugate  gradient  algorithm  for  nonsymmetric  linear 
systems,  Math.  Comp,  Citeseer,  1995. 

[24]  H.A.  Van  der  Vorst  and  Q.  Ye,  Residual  replacement  strategies  for  Knjlov  subspace  iterative  methods  for  the 
convergence  of  true  residuals,  SIAM  Journal  on  Scientific  Computing  22  (1999),  no.  3,  835-852. 

[25]  H.F.  Walker,  Implementation  of  the  GMRES  method  using  Householder  transformations,  SIAM  Journal  on 
Scientific  and  Statistical  Computing  9  (1988),  152. 


25 


Appendix  A:  Round-off  in  Matrix  Powers  Computation 

We  assume  the  polynomial  basis  is  computed  via  a  3-term  recurrence,  as  follows: 

Vk  =  fl{vk)  =  7 k(A  -  OkI)vk~ i  -  akvk-2  +  (k 

=  7 kAvk-i  ^  hkOk)vk-i  -  crkVk-2  +  Ck 

We  wish  to  find  a  bound  for  ( k : 

fl(Avk- 1)  =  Adk-i  +  4 

141  <  mAe\A\  •  I’Ofe—il 

fl{lk  •  fl(Avk-i))  =  'ykAvk-i  +  4 

|4 1  <  e{mAhk\  ■  \A\  •  |'0fc — 1 1  +  \"/kAvk-i\)  +  0(e2) 

fKakVk- 2)  =  ^4- 2  +  4 

|4|  <  e\akVk-2\ 

fl{lkOkVk-i)  =  lkOkvk-i  +  4 
|4|  <  e|(7fc4)4-i| 

fl(  fK(lkOk)Vk-l)  +  fl{akVk-2)  )  =  lkOkVk-l  +  4  +  crkvk- 2  +  4  +  4  =  7/A-4-1  +  OkVk-2  +  4 
|4|  <  e  (|7fc44fc-i  +  OkVk- 2|)  +  0(e2) 


|4|  <  e\akVk-2\+e\(lk0k)vk-i\+e(\7k9kVk-i  +  fJkVk-2\)  +  O{e2) 

<  2e  (|crfc4fc_2|  +  |(7fc4)4-i|)  +0(e2) 


fl{  fKlk  ■  fl{Avk- 1))  -  fl{  fl{{lkOk)Vk-l)  +  fl{(TkVk- 2) ) )  = 

IkAvk-i  +  4  +  IkOhVk-i  +  CTfcfifc- 2  +  4  +  4  =  IkAvk-i  +  'IkOkVk-i  +  crkVk-2  +  4 

|4|  <  e  (\jkAvk-i  -  IkOkVk-i  ~  okVk- 2I)  +  0(e2) 

|4|  <  e{mA\lk\  ■  \A\  ■  \vk-i\  +  |7fe^44-i|)  +  2e(|crfe4_2|  +  |(7fc4)4-i|) 

+e  (\jkAvk-i  ~  IkOkVk-i  ~  OkVk-2 1)  +  0(e2) 

<  e (141  +  2| (7fc6»fc)f)fe_i |  +2|crfc4_2|  +  \"fkAvk-i\  +mA\'yk\  •  \A\  ■  |4-i|) 


|Cfe|  <  e(|4|  +2|(7fc4)4-i|  +2|(Tfci)fc_2|  +  Y)kAvk-i\  +rnA \jk\  ■  \A\  •  |4-i|) 
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